Complementary multi-modality molecular self-supervised learning via non-overlapping masking for property prediction

Abstract Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.


Supplementary A Downstream data supplements
The 14 downstream task datasets are sourced from MoleculeNet [34] and can be categorized into physiology (BBBP, Tox21, ToxCast, SIDER, ClinTox), biophysics (BACE, MUV, HIV), physical chemistry (ESOL, FreeSolv, Lipophilicity), and quantum mechanics (QM7, QM8, QM9).All downstream tasks are also small molecule data tasks, with the number of atoms not exceeding 140.Our pre-training data set is the small molecule database ZINC15, with a maximum number of atoms of 26.Physiology focuses on macroscopic life systems, while biophysics employs physical methods to study biological phenomena.Physical chemistry analyzes the principles and laws governing the chemical behaviour of matter systems from a physical perspective, while quantum mechanics does the same for the chemical behaviour of matter systems from a physics perspective.Table 1 provides detailed information on these 14 datasets, including task types, evaluation metrics, molecular categories, data size, and split types.As shown in Table 1, we employed scaffold splitting for all benchmarks except QM9.Scaffold splitting segregates molecules with different two-dimensional structural frameworks into distinct subsets, offering a more challenging yet practical setup where test molecules can differ structurally from the training set.For QM9, we used random splitting based on previous research.Our evaluation metrics for regression tasks were RMSE and MAE.
Table 1.The detailed information of all the benchmarks for molecular property predictions used in this work.The benchmarks contain 8 graph classification datasets and 6 graph regression datasets.
The dataset details are as follows: 1. BBBP [1] comprises information concerning whether a compound exhibits the capability to traverse the blood-brain barrier.2. Tox21 [2] is a publicly accessible database designed to assess the toxicity profiles of various

B Experimental settings
Training settings: We train MoleSG for 90k iterations using the AdamW optimizer with a base learning ratio of 1e-3 and a warmup factor of 0.1 for the first 30 epochs.We set the masking ratio for the graph at 25% and for SMILES at 15%.We set the batch size to 32.Fine-tuning settings: We use the AdamW optimizer with a base learning rate of 1e-3 and different warmup factors for the first 30 epochs.We set a maximum of 150 training epochs, with early stopping applied when the validation dataset's best value is not improved for more than 20 epochs.As shown in

C Competitors
To verify MolSG's effectiveness, we conduct a thorough performance evaluation, comparing it with supervised and self-supervised learning competitors.
Competitors overview: Supervised competitors contain MPNN [15], DMPNN [16], CMPNN [17], CoMPT [18], and GraSeq [29], which are specifically designed for molecules.GraSeq [29] uses a complementary combination of graph neural networks and recurrent neural networks to model two types of molecular inputs.In our evaluation of predictive-based self-supervised learning, we consider several pre-training models.For instance, N-gram [19] assembles node embeddings in short walks and utilizes Random Forest or XGBoost for predicting molecular properties.PretrainGNN [20] and GROVER [21] incorporate both node-level and graph-level knowledge into pretext tasks.MGSSL [22] employs a pretext task involving motif generation, while GEM [23] focuses on geometry-level self-supervised learning strategies for molecular geometry knowledge acquisition.In the realm of contrastive-based models, we include GraphMVP [24], which integrates 3D geometric information into graph self-supervised learning.Additionally, MolCLR [25] applies general graph augmentation methods to molecules.To ensure a fair comparison, we substitute the original GCN and GIN backbones in MolCLR with the CoMPT backbone, resulting in an additional baseline referred to as MolCLRCoMPT for a comprehensive comparative analysis with our method.DVMP [26] employs a contrastive self-supervised learning approach to obtain knowledge from the same molecule.It extracts SMILES sequence information encoded by Transformers and graph information encoded by Graph Neural Networks (GNNs).For a fair comparison, we replace the feature extraction networks for both SMILES and graphs in DVMP with the same networks used in MoleSG, and the result is shown as DVMPMoleSG.We also adopt the same mask strategy as used in our approach.Mole-Bert [27] encodes atoms into meaningful discrete values and design a masked atom model for pre-training.KANO [28] is a molecular contrastive learning method enhanced with knowledge graphs and functional prompts.SMICLR [30] proposes a contrastive learning pre-training method that integrates molecular SMILES and graph modalities.

D Token vocabulary
A Simplified Molecular Input Line Entry System (SMILES) is a linear notation used to represent molecules simply and compactly, categorizing their components into three types: atoms, bonds, and other representations encoding ring closures in the graph.An example of a molecule represented in SMILES notation is shown in Figure 1, where the SMILES representation for a molecule with the structure c1cc(F)ccc1Cl is provided alongside its 3D molecular structure.In essence, letters such as C, Cl, and F generally represent atoms, while symbols like -, =, and # represent chemical bonds, and numbers denote adjacent atoms in ring-closing parts of the molecule.However, it's worth noting that the SMILES system is not a perfect one-to-one mapping between SMILES sequences and molecular structures.For instance, a molecule can have multiple equivalent SMILES representations, such as CCO, OCC, and C(O)C.To address this issue and provide a one-to-one mapping between SMILES and molecules, various standardization algorithms have been developed to ensure uniqueness in representing each molecular structure.In this paper, all SMILES representations used are standardized.Below, we will list all the elements found in the SMILES data, along with their corresponding token IDs.Knowing all the elements indicated by these IDs will facilitate the extraction of atomic representations for our non-overlapping masking strategy.

E Robustness verification
To better estimate performance, we report MoleSG's results for 10 seeds in Table 4 and Table 5.
From the results, we can observe that the results of ten repeated experiments are close to the results of three repeated experiments, indicating that our model has better robustness.

F The selection of encoder on fine-tuning
We have further added new experiments to combine graph and SMILES encoders.We reserve the graph encoder, SMILES encoder and backbone during downstream tasks, and add a prediction head behind the backbone.Since the backbone is pre-trained for mask reconstruction that has a large gap between the downstream tasks, we only try to freeze the graph/SMILES encoder instead of this backbone during downstream tasks.We add a new experiment using the following four different settings.The results in Table 6 and Table 7 show that only using the graph encoder is better than combining the two encoders.

G Learning efficiency analysis
As shown in Figure 1 and Figure 2 below, multi-modality pre-training can slightly speed up the convergence of downstream tasks than single-modality pre-training, showing the advantage over single-modality method in terms of the learning efficiency.

Figure 1 .
Figure 1.Validation RMSE comparison between models using graph encoders from multi-modality pre-training (Ours) and single-modality pre-training (Single pretrained) on the Freesolv downstream task.

Figure 2 .
Figure 2. Training loss comparison between models using graph encoders from multi-modality pre-training (Ours) and single-modality pre-training (Single pretrained) on the Freesolv downstream task.

Table 2 .
Training parameters of different downstream tasks.As shown in Table3, we list the random number seeds used in each downstream task, which are not only used for data partitioning but also for the rest of the experiments involving random number seeds.

Table 3 .
Random seeds in 14 downstream tasks.

Table 4 .
Comparison of the effect of MoleSG between three seeds and ten seeds on classification benchmarks.(Higher values indicate better performance.)

Table 5 .
Comparison of the effect of MoleSG between three seeds and ten seeds on regression benchmarks.(Lower values indicate better performance.)

Table 6 .
Ablation experiments of fine-tuning encoder selection on classification benchmarks.The performance of training all models from scratch is shown in the " Scratch " row.The performances of fine-tuning all pre-trained models, only freezing SMILES encoder, and only freezing graph encoder are shown in "Fine-tune_all", "SMILES_freeze", and "Graph_freeze" rows, respectively.(Higher values indicate better performance.)

Table 7 .
Ablation experiments of fine-tuning encoder selection on regression benchmarks.The performance of training all models from scratch is shown in the " Scratch " row.The performances of fine-tuning all pre-trained models, only freezing SMILES encoder, and only freezing graph encoder are shown in "Fine-tune_all", "SMILES_freeze", and "Graph_freeze" rows, respectively.(Lower values indicate better performance.)